An Adaptive Model for Optimizing Performance of an Incremental Web
Crawler
Jenny Edwards 1
,2, Kevin McCurley3, John
Tomlin3
2 Faculty of
Information Technology, University of Technology, Sydney, PO Box 123, Broadway
NSW 2007, Australia
3 IBM Research Division,
Almaden Research Center, 650 Harry Road, San Jose CA 95120 6099,
USA
jenny@it.uts.edu.au, mccurley,tomlin@almaden.ibm.com
Copyright is held by the author/owner.
WWW10, May 1-5, 2001, Hong Kong.
ACM 1-58113-348-0/01/0005.
Abstract
This paper outlines the design of a web crawler implemented for
IBM Almaden's WebFountain project and describes an optimization model for
controlling the crawl strategy. This crawler is scalable and incremental. The
model makes no assumptions about the statistical behaviour of web page changes,
but rather uses an adaptive approach to maintain data on actual change rates
which are in turn used as inputs for the optimization. Computational results
with simulated but realistic data show that there is no "magic bullet" -
different, but equally plausible, objectives lead to conflicting "optimal"
strategies. However, we find that there are compromise objectives which lead to
good strategies that are robust against a number of criteria.
Keywords: Crawler, Incremental Crawler, Scalability, Optimization
1 Introduction
Web
crawlers are an essential component of all search engines, and are increasingly
becoming important in data mining and other indexing applications, but they must
function somewhere between the cushion of Moore's Law and the hard place of the
exponential growth of the web. While some have questioned whether such
exponential growth is currently being maintained [1], the trend towards
automated production of web pages from databases makes it likely that such
growth will continue, or even accelerate, in the immediate future. Given that
the bandwidth for conducting crawls is neither infinite nor free it is becoming
essential to crawl the web in a not only scalable, but efficient way if
some reasonable measure of quality or freshness is to be maintained.
The WebFountain crawler to be outlined in this paper is designed to crawl the
entire web repeatedly, keeping a local copy of up to 1 MB of the text of each
page, plus metadata, in a repository, which is to be used for indexing,
mining, etc. Furthermore this crawler is incremental, that is, the
repository copy of each page is updated as soon as the actual web page is
crawled. However, even using this technique, the repository must always be out
of date to some (we hope minimal) extent. Our crawler, to be described in
greater detail below, has as an essential feature a set of queues of URLs to be
crawled, and several parameters which determine how URLs are to be selected from
these queues. The values of these parameters must be determined in a way that
leads to efficient crawling, and we present a model which computes these
parameters in such a way as to optimize some objective related to freshness.
This model assumes that the web is crawled for some variable, but pre-specified
number of discretized time periods which together constitute a cycle of
the crawler.
Several definitions of freshness, which is non-trivial to measure, have been
proposed by various authors, but in this paper we take the view that a page in
the repository is either up-to-date, or obsolete, that is, it no longer
matches the page on the real web. Clearly it is desirable to minimize the number
of such pages, but we shall see that such an objective can be formulated in a
number of ways. Pages become obsolete when they change on the real web between
crawls, but for simplicity we shall also consider the special case of new pages,
of which we have no copy in the repository, but of which we become aware, either
through new links, or exogenous information, and define them as also being
obsolete.
A particular feature of the model we describe is that it makes no a
priori assumptions about the way in which pages change, simply that we can
measure when changes occur, and record the frequency with the pages' metadata.
In this sense the model is adaptive in that the more cycles the crawler
is in operation, the more reliable and refined is the data which we have
available to drive it. For our purposes a page change is required to be
non-trivial, as determined by a shingle (see Broder et al. [4]). Furthermore,
while growth in the web changes the data in our model, it has no effect on the
size of the model nor the solution time.
No previous work that we know of makes use of precisely this set of
assumptions, but much of it has some bearing on our model. After a review of
this work, we present a more detailed description of the crawler, followed by a
mathematical description of the optimization model for the model parameters. We
then describe a series of computational experiments designed to test use of the
model and some of its variants on simulated but realistic data.
2 Previous Work
There
have been several studies of web crawling in its relatively short history, but
most of them have had a focus rather different from ours. Some have concentrated
on aspects relating to caching, eg Wills and Mikhailov[13] and Douglis et al.
[9]. Others have been principally interested in the most efficient and effective
way to update a fixed size database extracted from the web, often for
some specific function, such as data mining, see eg the work of Cho et al.
[5],[6],[7]. These studies were performed over time periods ranging from a few
days to seven months. However, for differing practical reasons, these crawlers
were restricted to subsets of web pages.
Several authors, eg Coffman et al. [8] approach crawling from a theoretical
point of view, comparing it to the polling systems of queueing theory, ie
multiple queue-single server systems. However, the problem equivalent to the
obsolescence time of a page is unexplored in the queueing literature.
A common assumption has been that page changes are a Poisson or memoryless
process, with parameter
as the rate of change for the pages. Brewington and Cybenko[1] and Cho and
Garcia-Molina[6] confirm this within the limits of their data gathering. This is
somewhat undermined by another study, based on an extensive subset of the web,
by Brewington and Cybenko[2] showing that most web pages are modified during US
working hours, ie 5am to 5pm (Silicon Valley Standard Time), Monday to Friday.
Nevertheless, the widely accepted Poisson model forms the basis for a series of
studies on crawler strategies. These lead to a variety of analytical models
designed to minimize the age or maximize the freshness of a collection by
investigating:
- how often a page should be crawled
- in what order pages should be crawled
- whether a crawling strategy should be based on the importance of
pages or their rates of change
The definitions or metrics for freshness, age and importance are not
completely consistent, but in general the freshness of a page refers to
the difference between the time of crawling a page and the current time, while
the age of a page is the difference between the time when the page last
changed and the current time. Widely differing metrics have been offered for the
importance of a page, but as the total number of possible metrics is
large and the crawler in this study does not currently use any of them, no
formal definitions will be given here.
While differing in detail, the experimental results in the referenced papers
agree on general trends, and in particular that:
- the average size of individual pages is growing
- the proportion of visual and other nontextual material is growing in
comparison to text
- the number of pages has been growing exponentially (Brewington and
Cybenko[1] give two different estimates of 318 and 390 days for the web to
double in size but also say that the growth rate is slowing. However, see our
comment at the beginning of Section 1.)
- different domains have very different page change rates
- the average age and lifetimes of pages are still quite low (cf Tables 1 and 2).
While
all of these trends are important for caching, the last three are more relevant
to the study of whole web crawling to be discussed here.
2.1 Crawling Models
In a series of papers, Cho et al. ([5],[6],[7]) address a
number of issues relating to the design of efficient crawlers. In [7] they
examine different crawling strategies using the Stanford University web pages as
a particular subset of the web, and examine several scenarios with different
physical limitations on the crawling. Their approach is to visit more important
pages first, and they describe a number of possible metrics for determining
this, as well as the order in which the chosen pages will be visited. They show
that it is possible to build crawlers which can obtain a significant portion of
important pages quickly using a range of metrics. Their model appears to be most
useful when trying to crawl large portions of the web with limited resources, or
when pages need to be revisited often to detect changes.
Cho and Garcia-Molina[6] derive a series of mathematical models to determine
the optimum strategy in a number of crawling scenarios, where the repository's
extracted copy of the web is of fixed size. They examine models where all pages
are deemed to change at the same average or uniform rate and where they
change at different or non-uniform rates. For a real life crawler, the
latter is more likely to be relevant. For each of these two cases, they examine
a variety of synchronization or crawling policies. How often the repository web
copy is updated depends on the crawling capacity or bandwidth available for the
required number of pages. Within that limitation is the question of how often
each individual page should be crawled to meet a particular objective such as
maximizing freshness. Cho and Garcia-Molina examine a uniform allocation
policy, in which each page is crawled at the same rate, and a non-uniform
or proportional policy where each page is crawled with a frequency that
is proportional to the frequency with which it is changed on the web, ie pages
which are updated frequently are crawled more often than those which change only
occasionally. Finally, they examine the order in which the pages should be
crawled. They develop models of:
- fixed order where all pages are crawled repeatedly in the same
order each cycle
- random order where all pages are crawled in each cycle but in a
random order, eg by always starting with the root URL for a site and crawling
all pages linked to it
- purely random where pages are crawled on demand, which may mean
some pages are crawled frequently and others never.
Their models show
that when pages change at a uniform rate, maximum freshness or minimum age is
obtained by crawling pages at a uniform rate and in fixed order. As we have
noted, most studies make the assumption that pages change at a variable rate
which may be approximated by a Poisson distribution, but in the second stage of
their study, Cho and Garcia-Molina assume the broader gamma distribution for
this change rate. Moreover, they prove that their particular model is valid for
any distribution, and conclude that when pages change at varying rates,
it is always better to crawl these pages at a uniform rate, ie ignoring the rate
of change, than at a rate which is proportional to the rate of change. However,
to maximize freshness they find a closed form solution to their model which
provides an optimal crawling rate which is better than the uniform rate. These
results were all derived for a batch or periodic crawler, where a
fixed number of pages is crawled in a given time period. These pages are used to
update a fixed size repository either by replacing existing repository pages
with newer versions or by replacing less important pages with those deemed to be
of greater importance.
Coffman et al. [8] built a theoretical model to minimize the fraction of time
pages spend out of date. Also assuming Poisson page change processes, and a
general distribution for page access time, they similarly show that optimal
results can be obtained by crawling pages as uniformly as possible.
In [5], Cho and Garcia-Molina devise an architecture for an incremental
crawler, and examine the use of an incremental versus a batch crawler under
various conditions, particularly those where the entire web is not crawled.
Crawling a subset (720,000 pages from 270 sites) of the web daily, they
determined statistics on the rate of change of pages from different domains.
They found, for example, that for the sites in their survey, 40% of pages in the
.com domain change daily in contrast to .edu and .gov domains where more than
50% of pages did not change in the four months of the study. They show that the
rates of change of pages they crawled can be approximated by a Poisson
distribution, with the proviso that the figures for pages which change more
often than daily or less often than four monthly are inaccurate. Using different
collection procedures, Wills and Mikhailov[13] derive similar conclusions.
A disadvantage of all these models is that they deal only with a fixed size
repository of a limited subset of the web. In contrast, our model is flexible,
adaptive, based upon the whole web and caters gracefully for its growth.
2.2 Page Statistics Derived from
Crawling
Statistics on page ages, lifetimes,
rates of change, etc are important for our model. Subject to the assumptions of
a constant size repository copy of the web, which is updated with periodic
uniform reindexing, Brewington and Cybenko[1] showed that in order to be sure
that a randomly chosen page is at least 95% fresh or current up to a day ago,
the web (of 800M) pages needs a reindexing period of 8.5 days, and a reindexing
period of 18 days is needed to be 95% sure that the repository copy of a random
page was current up to 1 week ago. In another study[2], the same authors
estimate the cumulative probability function for the page age in days on a log
scale as shown in Table 1.
|
cumulative probability |
page age (days) |
|
0.03 |
100 |
0.14 |
101 |
0.48 |
102 |
0.98 |
103 |
Table 1: Cumulative Probability Distribution of Page Age in
Days
As with similar studies, this does not accurately account for pages which
change very frequently, or those which change very slowly. Allowing for these
biases, the authors also estimate the cumulative probability of mean lifetime in
days, shown in Table 2:
|
cumulative probability |
mean lifetime (days) |
|
0.0 |
100 |
0.12 |
101 |
0.40 |
102 |
1.00 |
.6*103 |
Table 2: Cumulative Probability Distribution of Mean Lifetime
in Days
Brewington and Cybenko then use their data to examine various reindexing
strategies based on a single revisit period for all pages, and refer to the need
for mathematical optimization to determine the optimal reindexing
strategy, when the reindexing period varies per page. Such a model is precisely
the main focus of this paper.
3 The WebFountain Crawler
The model in this paper is designed to address the efficiency of
a crawler recently built at IBM as part of the WebFountain data mining
architecture. The features of this crawler that distinguish it from most
previous crawlers are that it is fully distributed and
incremental. By distributed, we mean that the responsibility for
scheduling, fetching, parsing and storing is distributed among a homogeneous
cluster of machines. URLs are grouped by site, and a site is assigned to a
machine in the cluster (a few very large sites such as geocities may actually be
split among several machines). There is no global scheduler, nor are there any
global queues to be maintained. Moreover, there is no machine with access to a
global list of URLs.
An incremental crawler (as opposed to a batch crawler) is
one that runs as an ongoing process, and the crawl is never regarded as
complete. The underlying philosophy is that the local collection of documents
will always grow, always be dynamic, and should be constructed with the goal of
keeping the repository as fresh and complete as possible. Instead of devoting
all of its effort to crawling newly discovered pages, a percentage of its time
is devoted to recrawling pages that were crawled in the past, in order to
minimize the number of obsolete pages. Note that our use of the term
"incremental" differs from that of Cho et al. [5],[6],[7]. Their definition
assumes that the document collection is of static size, and a ranking function
is used to replace documents in the collection with more important documents. We
regard the issue of incrementality to be independent of the size of the
collection, and we allow for growth of the crawler, in order to meet the demands
of an ever-expanding web.
The WebFountain crawler is written in C++, is fully distributed, and uses MPI
(Message Passing Interface) for communication between the different components.
The three major components are the Ants, which are the machines assigned
to crawl sites, duplicate detectors, which are responsible for detecting
duplicates or near-duplicates, and a single machine called a Controller.
The Controller is the control point for the machine cluster, and keeps a dynamic
list of site assignments on the Ants. It is also responsible for routing
messages for discovered URLs, and manages the overall crawl rate, monitoring of
disk space, load balancing, etc.
Other crawlers have been written that distribute the load of crawling across
a cluster, but they generally distribute the work in different ways. Due to the
competitive nature of the Internet indexing and searching business, few details
are available about the latest generation of crawlers. The first generation
Google crawler [3] is apparently designed as a batch crawler, and is only
partially distributed. It uses a single point of control for scheduling of URLs
to be crawled. While this might appear convenient, it also provides a bottleneck
for intelligent scheduling algorithms, since the scheduling of URLs to be
crawled may potentially need to touch a large amount of data (eg, robots.txt,
politeness values, change rate data, DNS records, etc). Mercator[10] supports
incremental crawling using priority values on URLs and interleaving crawling new
and old URLs.
The crawl scheduling mechanism of the WebFountain crawler resembles Mercator
in that it is fully distributed, very flexible, and can even be changed on the
fly. This enables efficient use of all crawling processors and their underlying
network. The base software component for determining the ordering on URLs to be
crawled consists of a composition of sequencers. Sequencers are
software objects that implement a few simple methods to determine the current
backlog, whether there are any URLs available to be crawled, and control of
loading and flushing data structures to disk. Sequencers are then implemented
according to different policies, including a simple FIFO queue or a priority
queue. Other Sequencers are combiners, and implement a policy for joining
sequencers. Examples include a round robin aggregator, or a priority aggregator
that probabilistically selects from among several sequencers according to some
weights. In addition, we use the Sequencer mechanism to implement the crawling
politeness policy for a site. The ability to combine sequencers and cascade them
provides a very convenient means to build a flexible recrawl strategy.
The strategy that we decided on for implementing the crawl strategy is
illustrated in Figure 1.
Figure 1: The Structure of Queues that Feed the URL Stream in
the WebFountain Crawler
At the top level, there are two queues,
one for immediate crawling that is intended to be used from a GUI, and one that
aggregates all other URLs. Under that, each Ant is assigned a list of sites to
be crawled, and maintains an active list of approximately 1000 sites that are
currently being crawled. The selection of URLs to be crawled is taken from this
active list in a round robin fashion. This avoids crawling any particular site
too frequently - the so called politeness criterion. In addition, each Ant is
multithreaded to minimize latency effects. When a site is added to the active
list, a sequencer is constructed that loads all data structures from disk,
merges the list of newly discovered URLs with the list of previously crawled
URLs for that site, and prepares two queues. One of these queues contains URLs
that have never been crawled, and one contains URLs that are scheduled for a
recrawl. It is the way in which we manage our never-crawled and recrawl lists
that is to be determined by our optimization model. politeness
4 Model Description
We constructed this model
for robustness under both growth of the web and changes to its underlying
nature. Hence we do not make any a priori assumptions about the distribution of
page change rates. However, we do make an assumption that particular historical
information on each page is maintained in the metadata for that page.
Specifically, each time a page is crawled, we record whether that page has
changed since it was last crawled, and use this information to put the page into
one of (at most 256) change-frequency "buckets", recorded as one byte of the
page's metadata. Clearly, such data becomes more reliable as the page ages. In
practice, page change rates may range from as often as every 15 minutes to as
infrequently as once a year or less (eg a government department contact page).
The very rapidly changing pages (several times a day) are almost all media sites
(CNN, for example) and we feel that this relatively small and specialized set of
sites should be handled separately. The remaining pages in the repository are
grouped into B buckets each containing bi pages which have similar rates of
change. a priori B buckets
bi
A (theoretically)complete crawl of the web is modeled as taking place in a
crawler cycle. Both the number of time periods in such a cycle, T,
and the (equal) length of each period may be varied as needed. Our fundamental
requirement is to estimate the number of obsolete pages in each frequency bucket
at the end of each time period. The way in which this is calculated is best
illustrated by following the tree of alternatives in Figure 2.
We consider a generic time period and a generic bucket (dropping the
subscript) of pages in the repository, containing b pages at the
beginning of the time period. Let the number of such pages for which our
repository copy is already obsolete at the beginning of the time period be
y, leaving (b-y) up-to-date. Now let x be the number of
pages in this bucket crawled during the time period, and assume that obsolete
and up-to-date pages are crawled with equal probability. This gives the
partition of pages in the bucket seen at the second level of branches in the
tree. Finally, let a be the (given) proportion of pages which
change in the bucket during the time period, and assume that such a change is
detected if the page is crawled in the same time period. The b pages in
the bucket are now partitioned among the leaves of the tree, and we easily see
that the leaves marked * correspond to obsolete
pages. Note that by adding the expressions attached to these two leaves we
obtain an expression for the number of obsolete pages at the end of this
time period, as a function of the data at the beginning of the period and of the
decision variable x, which specifies how many pages in this bucket to
crawl. This relationship is fundamental to our model, and to the way in which
the queue of "old" URLs is managed.
Figure 2: Identification of Obsolete Pages in each Bucket per
Time Period
In addition to the "old" pages we must deal with the "new" pages either
discovered or exogenously supplied during the crawl. Let the number of these new
pages crawled during a time period be z (these are then added to the
appropriate buckets in the repository). The remaining new uncrawled pages are
represented by n. These pages are also regarded as obsolete and will
remain in that state until crawled.
We are now ready to give the definitions of the variables and the formal
model.
|
T |
= |
number of time periods in model |
B |
= |
number of buckets, where bucket refers to pages which
change at approximately the same rate |
cconstit |
= |
average time in seconds to crawl an old page
in bucket i in time period t |
dconstt |
= |
average time in seconds to crawl a new page in
time period t |
Cconstt |
= |
total number of seconds available for crawling
in time period t |
oldwtit |
= |
experimental proportional weight on crawling obsolete
pages in bucket i in time period t |
newwtt |
= |
experimental proportional weight on crawling new
pages in time period t |
oldnwtit |
= |
probability that when an old page in bucket i
is crawled in time period t, it finds a new page |
newnwtt |
= |
probability that when a new uncrawled page is crawled
in time period t, it finds a new page |
nconst |
= |
minimum number of new pages brought to the attention
of the crawler per time period |
bit |
= |
number of pages in bucket i at the end of time
period t |
pi |
= |
distribution of new pages to buckets where
pi is proportional to bi
and
|
|
|
B
i =1
|
pi = 1
| |
ait |
= |
fraction of pages in bucket i which change in
time period t |
yit |
= |
number of obsolete pages in bucket i at end of
time period t |
xit |
= |
number of crawled pages in bucket i in time
period t |
nt |
= |
number of new uncrawled pages at end of time period
t |
zt |
= |
number of new pages crawled in time period
t. | |
The objective of the time period model is to minimize the weighted sum of
obsolete pages, ie:
|
|
|
|
|
|
T
t =1
|
|
B
i =1
|
oldwtityit +
newwttnt
| |
subject to the following constraints:
| |
The bandwidth available for crawling during each time
period may not be exceeded. | |
|
|
|
B
i =1
|
cconstitxit
+dconsttzt
| |
(1) |
The number of obsolete existing or old pages,
yit is updated as
discussed above at every time period.
| |
|
|
(1-(xit/
bi,t-1))((1-ait)
yi,t-1 +
aitbi,t-1) | |
(2) |
The number of new uncrawled pages,
nt is updated every time
period. | |
|
|
nt-1 + |
B
i =1
|
oldnwtit
xit +
(newnwtt-1)zt
+nconst | |
(3) |
The total number of pages in each bucket,
bit is updated in every time
period. | |
|
|
|
(4) |
The number of existing pages, xit
crawled in any bucket in any time period must be fewer
than the total number of pages available to be crawled in
the bucket,
bi,t-1. | |
|
|
|
(5) |
Similarly, in any time period, the number of new pages
crawled, zt must be less than the number
of new uncrawled pages,
nt-1. | |
|
|
|
(6) |
|
|
|
Each constraint holds for all t = 1,
....,T and, where relevant, for all i = 1,
....,B. | | |
|
The critical solution outputs are the ratios xit/
bit and the zt values, which tell us how
many URLs in the "old" and "new" queues we should crawl in each time period, and
hence the probabilities p in Figure 1.
5 Solution
Since the balance equation 2 for updating
yit is highly nonlinear, the model must be solved by a
nonlinear programming (NLP) method capable of solving large scale problems. For
our model runs, it was assumed that there are 500M unique pages on the web, with
a growth rate which allows the web to double in size in approximately 400 days.
With a value of 14 for the number of time periods T, and 255 for
B, the number of buckets, the basic model has approximately 11200
variables, of which 10000 are nonlinear and 11200 constraints, of which about
3300 are nonlinear. Even after the use of various presolve techniques to reduce
the size of the problem, solution proved to be non-trivial.
We used the NEOS[12] public
server system to run experiments with several different NLP solvers on the
models and found that the standard NLP package MINOS[11] gave the
best and most reliable results. Solution times for all variations of the model
were around ten minutes on a timeshared machine.
Since the WebFountain crawler for which this model was designed is in its
early stages of development, we have very little actual historical data for such
parameters as the rate of change of various pages. We have therefore used
simulated, but realistic data, based on the previous studies we have cited from
the literature.
5.1 Results
Since our model runs use
estimated data, many experiments were run for a range of values of the critical
parameters. Reassuringly, the model is quite robust under most of these changes.
However, it turns out to be quite sensitive to changes in the weights in the
objective function, oldwtit and newwtt. Given the overall aim of minimising the
total number of obsolete pages, we describe the implementation of three of the
many possible variations of the objective function weights: oldwtit newwtt
- Strategy 1 gives equal weights (summing to 1) to each time period in a
cycle of the crawler
- Strategy 2 gives the last time period the total weight of 1 and the other
time periods, zero weight, ie the model is minimising the number of obsolete
pages just on the last time period of a crawler cycle
- Strategy 3 gives the last time period a high weighting and the remaining
time periods very low weightings, ie it is still trying to minimise the number
of obsolete pages in the last time period but it is also taking into account
the obsolete pages in all time periods.
|
|
objective |
Total Pages at end |
Total Obsolete Pages at end |
|
Strategy 1 |
32.592×107 |
5.224×108 |
3.086×107 |
Strategy 2 |
1.722×107 |
5.184×108 |
1.609×107 |
Strategy 3 |
2.138×107 |
5.224×108 |
1.702×107 |
Table 3: Model Statistics
5.2 Experiment 1
As shown in Table 3,
Strategy 2 gives the minimum value of the objective function, ie the weighted
total of obsolete pages at the end of a crawler cycle, followed by Strategy 3.
The objective value for Strategy 1 is considerably higher. Figure 3 illustrates
the effects of implementing these strategies.
Strategy 2 recommends crawling each page just once during a cycle of the
crawler. This uniform crawling is in line with the theoretical results of Cho
and Garcia-Molina [6] and Coffman et al. [8]. Strategy 1 recommends crawling
fast changing pages in many time periods of a cycle. For these pages, the crawl
rate is usually higher even than the page change rate. However, it does not
crawl at all, those pages which fall into the lowest 40% of page change rates.
Strategy 3 is a compromise. It recommends crawling all pages at least once in a
crawler cycle with the fastest changing 18% being crawled more than once per
cycle.
Figure 3: Recommended Number of Crawls per Crawler Cycle for
Pages with Different Change Rates under Different Crawl
Strategies
Table 3 also
shows that while the total number of web pages in the repository at the end of a
crawler cycle is similar under each strategy, the total number of obsolete pages
is not. Figure 4 examines the total number of obsolete pages in each time period
of a cycle under each strategy.
Figure 4: Total Number of Obsolete Pages each Time Period
under Different Stategies
If we disregard the actual objective
function and look at the number of obsolete pages we see that in any given time
period (except the last), Strategy 1 always has fewer obsolete pages than
Strategy 3 and considerably fewer than Strategy 2. Because of the weights in the
objective functions for the different strategies, the lower number of obsolete
pages for Strategies 2 and 3 in the last time period is expected.
Mathematically, Strategy 2 is optimal. However, depending on the purpose(s)
for which a crawler is designed, it may not be acceptable to have an incremental
crawler ignore the page change rate and crawl all pages at a uniform rate, ie
just once each crawler cycle. In terms of minimizing the number of obsolete
pages each time period, Strategy 1 is clearly superior. However, it does not
crawl 40% of the pages during a cycle. This may also be unacceptable for many of
the possible uses of the crawler.
Strategy 3 again provides a reasonable compromise. It crawls all pages within
a cycle and still has a somewhat lower mathematical minimum objective value than
Strategy 1.
5.3 Experiment 2
Many versions of the model
were run changing the values of different parameters. In all cases the results
were as expected. Experiment 2 is a typical example. In the initial or Version 1
of the model, the 500M repository pages were assumed to be distributed equally
to each bucket, ie it was assumed that there are the same number of pages
corresponding to each of the different page change rates. Each of the 255
buckets received 2M pages. In Version 2, the buckets representing the 25% of the
fastest changing page rates and the 25% of the slowest changing pages all
received 3M pages initially. The buckets representing the middle 50% of page
change rates, each received 1M pages initially. Table 4 shows the effect on the
total number of obsolete pages of this change.
|
|
objective |
Total Pages at end |
Total Obsolete Pages at end |
|
Strategy 1 v1 |
32.592×107 |
5.224×108 |
3.086×107 |
Strategy 1 v2 |
37.744×107 |
5.214×108 |
3.534×107 |
Strategy 3 v1 |
2.138×107 |
5.224×108 |
1.702×107 |
Strategy 3 v2 |
2.500×107 |
5.193×108 |
1.972×107 |
Table 4: Model Statistics for Versions 1 and 2 of Strategies 1
and 3
Figure 5: Effects on the Number of Obsolete Pages per Time
Period of Varying the Page Change Rates Distribution
Figure 5 shows that this change made little difference to the distribution of
obsolete pages for each time period. For both Strategy 1 and Strategy 3, there
are a larger number of obsolete pages in Version 2. This is expected as the
models in Version 2 started with a higher number of obsolete pages and given the
finite capacity of the crawler, this number will grow during a 14 time period
cycle of the crawler unless forced to reduce as in the last few time periods of
Strategy 3.
Experiment 2 does show the robustness of the model to changes in the initial
parameters.
5.4 Experiment 3
The object of Experiment 3
was to determine if the number of obsolete pages continued to grow with time, or
if this number reached stabilisation. There were four runs, all of the model
with an objective corresponding to Strategy1. Figure 6 illustrates the results.
Strategy 1 (Version 1) was run for 14 time periods and for 28 time periods as
was Version 3. In Version 3, it was assumed that the page change rates were half
as frequent as in Version 1. The distribution of obsolete pages over time
periods between and within each version shows the expected similarities. As can
be seen in any given time period, the number of obsolete pages for Version 3 is
approximately half that of Version 1. More importantly, it can be seen that for
both versions, the number of obsolete pages is tending to stabilise. It was not
possible to run the crawler model for a longer period and obtain a useful
mathematical solution, nor would the crawler be run for this long in practice
without an update of the parameters and reoptimisation.
Figure 6: Trend towards Stabilisation in the Number of
Obsolete Pages over Time
The objectives we used, based on the different weighted sums of obsolete
pages, correspond to maximising the freshness of the collection under different
crawler aims, eg all the web must be crawled each cycle or a certain percentage
of pages in the repository should be guaranteed to be no more than say, a week
old.
Taking all the experiments into consideration, the results are consistent
with an implementation philosophy of using Strategy 2 in early cycles of the
crawler, to drive down the number of obsolete pages in the repository quickly.
It would then be beneficial to switch to Strategy 1 or 3 to maintain a stable
number.
6 Conclusions
The computational results we
have obtained (albeit with simulated data) suggest that an efficient crawling
strategy can be implemented for the WebFountain (or any) incremental crawler
without making any theoretical assumptions about the rate of change of pages,
but by using information gleaned from actual cycles of the crawler which
adaptively build up more extensive and reliable data. Thus the model we have
described is adaptive at two levels: within a crawler cycle it coordinates the
management of the URL queues over the cycle's component time periods, and
between cycles the data necessary for the optimization is updated for the next
cycle - particularly the change rates, and new page creation rates. We look
forward to full scale implementation of the model when the WebFountain crawler
begins regular operation.
7 Acknowledgements
The authors would like to
thank members of IBM's WebFountain team, in particular, Sridhar Rajagopalan and
Andrew Tomkins. They would also like to thank Michael Saunders of Stanford
University and the NEOS team for help with the solution of the nonlinear
programming problems and Paula Wirth for technical assistance.
References
- Brewington, B. & Cybenko, G. (2000a) How Dynamic is the Web?,
Proceedings of the 9th World Wide Web Conference (WWW9)
- Brewington, B. & Cybenko, G. (2000b) Keeping Up with the Changing
Web, Computer, May, pp 52-58
- Brin, S. & Page, L. (1998) The Anatomy of a Large-Scale
Hypertextual Web Search Engine, Proceedings of the 7th World Wide Web
Conference, (WWW7), http://www-db.stanford.edu/~backrub/google.html
- Broder, A.Z., Glassman, S.C., Manasse, M.S. & Zweig, G. (1997)
Syntactic Clustering of the Web, Proceedings of 6th International World
Wide Web Conference (WWW6).
http://www.scope.gmd.de/info/www6/technical/paper205/paper205.html
- Cho, J. & Garcia-Molina, H (2000b) The Evolution of the Web and
Implications for an Incremental Crawler, Proceedings of 26th International
Conference on Very Large Databases (VLDB)
- Cho, J. & Garcia-Molina, H. (2000a) Synchronizing a Database to
Improve Freshness, Proceedings of 2000 ACM International Conference on
Management of Data (SIGMOD)
- Cho, J. Garcia-Molina, H. & Page, L (1998) Efficient Crawling
Through URL Ordering, Proceedings of the 7th World Wide Web Conference
(WWW7)
- Coffman, E., Liu, Z. & Weber, R. (1997) Optimal Robot Scheduling
for Web Search Engines, Rapport de recherche no 3317, INRIA Sophia
Antipolis
- Douglis, F., Feldmann, A. & Krishnamurthy, B. (1997) Rate of Change
and other Metrics: a Live Study of the World Wide Web, Proceedings of
USENIX Symposium on Internetworking Technologies and Systems
- Heydon, A. & Najork, M. (1999) Mercator: A Scalable, Extensible Web
Crawler , World Wide Web, 2(4), 219-229
- MINOS (http://www.sbsi-sol-optimize.com/Minos.htm)
- NEOS Server for Optimization (http://www-neos.mcs.anl.gov)
- Wills, C. & Mikhailov, M. (1999) Towards a Better Understanding of
Web Resources and Server Responses for Improved Caching, Proceedings of
the 8th World Wide Web Conference (WWW8)
Vitae
|
Jenny Edwards received her PhD in Computer Science from the University
of Sydney. Her research interests include mathematical programming and
algorithms for parallel processing. She has been an academic in the
Faculty Of Information Technology at the University of Technology, Sydney
since 1976. She is a member of ACS, INFORMS and the Mathematical
Programming Society. She is currently on a year's leave, some of which is
being spent at IBM Almaden Research Center where this work was completed.
|
|
Kevin S. McCurley joined IBM Almaden Research Center in
1987. He received a Ph.D. in Mathematics from the University of
Illinois, and has previously held positions at Michigan State University,
University of Southern California, and Sandia National Laboratories. His
current research interests include information security and web
technologies. |
|
John Tomlin gained his B.Sc.(Hons) and Ph.D. in mathematics
at the University of Adelaide, South Australia. His current interests are
in large scale optimization, sparse matrices and the World Wide Web. He
joined the IBM Research Division in 1987. He is a member of ACM, INFORMS
and the Mathematical Programming Society. |
Footnotes:
1 This work was completed while the author was on leave at
IBM Almaden Research Center